Skip to content

Conversation

amd-hhashemi
Copy link

@amd-hhashemi amd-hhashemi commented Jul 29, 2025

Associated JIRA ticket number/Github issue number

Fixes SWDEV-546808

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Continuous Integration

What were the changes?

match_any() showing suboptimal perf. This change converts it to a sequence of wave_rot1 dpp ops.

Why are these changes needed?

Prior solution uses inefficient firstlane in loop.

Updated CHANGELOG?

  • Yes
  • No, Does not apply to this PR.

Added/Updated documentation?

  • Yes
  • No, Does not apply to this PR.

Additional Checks

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally.
  • Any dependent changes have been merged.

retval = 1;
for (int i = 1; i < static_cast<int>(warpSize); i++) {
if constexpr (sizeof(value) == 8)
dill_.ill = __builtin_amdgcn_mov_dpp(dill_.ill, 0x134, 0xf, 0xf, 0); //wave_rol1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the comment "wave_rol1" mean? Also, the magic constants need comments like (/* foo= */ 0x134). Or even better would be const unsigned int meaningfulName = 0x134.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's just an external personal log. Having meaningful names (with comments if necessary) at the top of this file will be much better. Could be const integers, or could be an enum class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments added in new check-in. Let me know if it makes it clearer.

dill_.i[0] = __builtin_amdgcn_mov_dpp(dill_.i[0], 0x134, 0xf, 0xf, 0); //wave_rol1
retval |= ((decltype(mask))(dill_.val == value)) << i;
}
int rotv = __lane_id();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does rotv mean? More comments will be useful for future reference, like what is the overall strategy here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the rotate amount. I added comment to code.

retval = 1;
for (int i = 1; i < static_cast<int>(warpSize); i++) {
if constexpr (sizeof(value) == 8)
dill_.ill = __builtin_amdgcn_mov_dpp(dill_.ill, 0x134, 0xf, 0xf, 0); //wave_rol1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to read up on this, but just to be sure, do these _dpp intrinsics work with both wave64 and wave32?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why it wouldn't work for wave32. That is why I put the rotate amount as static_cast(warpSize).

}
}
} else {
union dill { unsigned int i[2]; unsigned long long ill; decltype(value) val; } dill_ = { .val = value };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we push this union declaration into the if constexpr, then we will only need one integer member with the correct width.

Copy link
Author

@amd-hhashemi amd-hhashemi Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't match_any() need to handle int and long?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was kinda assuming the compiler will optimize out the upper bits since the whole function is a template anyway. But this works too.

yxsamliu pushed a commit to yxsamliu/clr that referenced this pull request Aug 5, 2025
* SWDEV-460151 - add gfx1201 to amd-staging clr

* SWDEV-460151 - removed pal macro

---------

Co-authored-by: Jimbo Xie <[email protected]>
@amd-hhashemi amd-hhashemi changed the title optimize match_any() with dpp wave_rot1 SWDEV-546808 optimize match_any() with dpp wave_rot1 Aug 9, 2025
Copy link
Contributor

@ssahasra ssahasra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One big problem is that I don't know what the DPP_CTRL value 0x134 means. It's not obvious from reading the SPG or the ISA manual. I think someone more familiar with DPP needs to review that this always works as expected, and on all supported architectures.

} else {
union dill { unsigned int i[2]; unsigned long long ill; decltype(value) val; } dill_ = { .val = value };
retval = 1;
//Do a full rotate of the wave lanes, using dpp with "wave_rol1" control (ID: 0x134).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//Do a full rotate of the wave lanes, using dpp with "wave_rol1" control (ID: 0x134).
// Do a full rotate of the wave lanes, using dpp with "wave_rol1" control (ID: 0x134).

Please add a space after the // for every comment line. I don't know if HIP has yet adopted a coding style, but this kind of whitespace is very commonly considered correct.

//In doing so each lane gets a mask of matches with all other lanes in the wave in retval.
for (int i = 1; i < static_cast<int>(warpSize); i++) {
if constexpr (sizeof(value) == 8)
dill_.ill = __builtin_amdgcn_mov_dpp(dill_.ill, 0x134 /*dpp_ctrl=wave_rol1*/, 0xf/*row_mask*/, 0xf/*clmn_mask*/, 1/*bound_ctrl*/);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments are partially helpful, but I am still not okay with magic constants appearing somewhere in code. Those constants like 0x134 need to be declared in one suitable place, maybe as an enum or as constant integers with meaningful names.

@rocm-devops
Copy link
Contributor

Imported to ROCm/rocm-systems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants